Representation Quality in Text Classification: An Introduction and Experiment

نویسنده

David D. Lewis

چکیده

The way in which text is represented has a strong impact on the performance of text classification (retrieval and categorization) systems. We discuss the operation of text classification systems, introduce a theoretical model of how text representation impacts their performance, and describe how the performance of text classification systems is evaluated. We then present the results of an experiment on improving text representation quahty, as well as an analysis of the results and the directions they suggest for future research. 1 T h e T a s k o f T e x t C l a s s i f i c a t i o n Text-based systems can be broadly classified into classification systems and comprehension systems. Text classification systems include traditional information retrieval (IR) systems, which retrieve texts in response to a user query, as well as categorization systems, which assign texts to one or more of a fixed set of categories. Text comprehension systems go beyond classification to transform text in some way, such as producing summaries, answering questions, or extracting data. Text classification systems can be viewed as computing a function from documents to one or more class values. Most commercial text retrieval systems require users to enter such a function directly in the form of a boolean query. For example, the query (language OR speech) AND A U = Smith specifies a 1-ary 2-valued (boolean) function that takes on the value T R U E for documents that are authored by Smith and contain the word language or the word speech. In statistical IR systems, which have long been investigated by researchers and are beginning to reach the marketplace, the user typically enters a natural language query, such as Show me uses of speech recogni$ion. The assumption is made that the at tr ibutes (content words, in this case) used in the query will be strongly associated with documents that should be retrieved. A statistical IR system uses these at tr ibutes to construct a classification function, such as: f ( x ) .~ Cl ~shaw Jr C2~3ttse s -Jr C3Y3speech -~C4~recog~zitio~r t This function assumes that there is an at t r ibute corresponding to each word, and that a t t r ibute takes on some value for each document, such as the number of occurrences of the word in the document. The coefficients c~ indicate the weight given to each at tr ibute. The function produces a numeric score for each document, and these scores can be used to determine which documents to retrieve or, more usefully, to display documents to the user in ranked order: S p e e c h R e c o g n i t i o n Applications 0.88 Jones Gives S p e e c h at Trade S h o w 0.65 S p e e c h and S p e e c h Based Systems 0.57 Most methods for deriving classification functions from natural language queries use statistics of word occurrences to set the coefficients of a linear discriminant function [5,20]. The best results are obtained when supervised machine learning, in the guise of relevance feedback, is used [21,6]. Text categorization systems can also be viewed as computing a function defined over documents, in this case a k-ary function, where k is the number of categories into which documents can be sorted. Rather than deriving this function from a natural language query, it is typically constructed directly by experts [28], perhaps using a complex pat tern matching language [12]. Alternately, the function may be induced by machine learning techniques from large numbers of previously categorized documents [17,11,2]. 1 . 1 T e x t R e p r e s e n t a t i o n a n d T h e C o n c e p t L e a r n i n g M o d e l Any text classification function assumes a particular representation of documents. With the exception of a few experimental knowledge-based IR systems [15], these text representations map documents into vectors of att r ibute values, usually boolean or numeric. For example, the document title "Speech and Speech Based Systems" might be represented as

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

Palarimetric Synthetic Aperture Radar Image Classification using Bag of Visual Words Algorithm

Land cover is defined as the physical material of the surface of the earth, including different vegetation covers, bare soil, water surface, various urban areas, etc. Land cover and its changes are very important and influential on the Earth and life of living organisms, especially human beings. Land cover change monitoring is important for protecting the ecosystem, forests, farmland, open spac...

متن کامل

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...

متن کامل

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1990

Representation Quality in Text Classification: An Introduction and Experiment

نویسنده

چکیده

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

A New Document Embedding Method for News Classification

Palarimetric Synthetic Aperture Radar Image Classification using Bag of Visual Words Algorithm

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

عنوان ژورنال:

اشتراک گذاری